In [1]:

    
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
plt.rcParams['figure.figsize'] = 8, 8
plt.rcParams['axes.grid'] = True

Likely that we can calibrate our predictions to improve our performance as described in this blog post. First, though, we should plot the reliability for a typical classifier.



In [2]:

    
cd ..









    



/home/gavin/repositories/hail-seizure



In [3]:

    
from python import utils
import pickle



In [7]:

    
import sklearn.linear_model
import sklearn.cross_validation
import sklearn.metrics



In [4]:

    
with open("fsdetailed.pickle","rb") as f:
    predictions,labels,weights,segments = pickle.load(f)



In [5]:

    
x,y = utils.reliability_plot(predictions,labels)



In [8]:

    
sklearn.metrics.roc_auc_score(labels,predictions)









    Out[8]:





0.86384377771661014



In [9]:

    
sklearn.metrics.log_loss(labels,predictions)









    Out[9]:





0.25863190142133302



In [10]:

    
plt.plot(x,y)
plt.xlabel("Binned mean predicted")
plt.ylabel("Binned fraction positive")
plt.plot([0, 1], [0, 1], 'k--')









    Out[10]:





[<matplotlib.lines.Line2D at 0x7fbd6054ba20>]

The above is not good. But, that might be good news for us, because if we can improve it then we'll have a good chance of improving our score.

Worth a try, anyway.

Platt scaling

As it says on the blog:

You essentially create a new data set that has the same labels, but with one dimension (the output of the SVM). You then train on this new data set, and feed the output of the SVM as the input to this calibration method, which returns a probability. In Platt’s case, we are essentially just performing logistic regression on the output of the SVM with respect to the true class labels.

Doing this is pretty easy for us.



In [11]:

    
lr = sklearn.linear_model.LogisticRegression()



In [12]:

    
# cross validation again
prds_cal = []
labels_cal = []
cv = sklearn.cross_validation.StratifiedShuffleSplit(labels)
for train,test in cv:
    lr.fit(predictions[np.newaxis].T[train],labels[train])
    prds_cal.append(lr.predict_proba(predictions[np.newaxis].T[test])[:,1])
    labels_cal.append(labels[test])
# stack results
prds_cal = np.hstack(prds_cal[:])
labels_cal = np.hstack(labels_cal[:])

AUC has not changed significantly:



In [13]:

    
sklearn.metrics.roc_auc_score(labels_cal,prds_cal)









    Out[13]:





0.85999665348341958



In [14]:

    
sklearn.metrics.log_loss(labels_cal,prds_cal)









    Out[14]:





0.22326414847102732



In [15]:

    
x,y = utils.reliability_plot(prds_cal,labels_cal)
plt.plot(x,y)
plt.xlabel("Binned mean predicted")
plt.ylabel("Binned fraction positive")
plt.plot([0, 1], [0, 1], 'k--')









    Out[15]:





[<matplotlib.lines.Line2D at 0x7fbd6020f2e8>]

Isotonic regression

This is very similar to Platt scaling:

The second popular method of calibrating is isotonic regression. The idea is to fit a piecewise-constant non-decreasing function instead of logistic regression. Piecewise-constant non-decreasing means a stair-like shape...

Luckily this is also implemented in scikit-learn. Fairly easy to switch this in:



In [16]:

    
import sklearn.isotonic



In [17]:

    
iso = sklearn.isotonic.IsotonicRegression(out_of_bounds='clip')



In [18]:

    
prds_cal = []
labels_cal = []
cv = sklearn.cross_validation.StratifiedShuffleSplit(labels)
for train,test in cv:
    iso.fit(predictions[train],labels[train])
    prds_cal.append(iso.transform(predictions[test]))
    labels_cal.append(labels[test])
# stack results
prds_cal = np.hstack(prds_cal[:])
labels_cal = np.hstack(labels_cal[:])









    



/usr/lib/python3.4/site-packages/scipy/interpolate/interpolate.py:445: RuntimeWarning: invalid value encountered in true_divide
  slope = (y_hi - y_lo) / (x_hi - x_lo)[:, None]



In [19]:

    
x,y = utils.reliability_plot(prds_cal,labels_cal)
plt.plot(x,y)
plt.xlabel("Binned mean predicted")
plt.ylabel("Binned fraction positive")
plt.plot([0, 1], [0, 1], 'k--')









    Out[19]:





[<matplotlib.lines.Line2D at 0x7fbd601d9128>]

There appears to be a problem with isotonic regresssion or the way I'm using it. Data we're getting out has some problems:



In [20]:

    
delinds = []
for i, (label, prediction) in enumerate(zip(labels_cal,prds_cal)):
    if not np.isfinite(prediction):
        print(i,prediction)
        delinds.append(i)









    



292 nan
1070 nan
1273 nan
1663 nan
2243 nan
3793 nan
5211 nan
5349 nan
5484 nan
5583 nan
5734 nan
6107 nan
6396 nan
6558 nan
6577 nan
7250 nan
8639 nan
9117 nan



In [21]:

    
mask = np.ones(len(prds_cal), dtype=bool)
mask[delinds] = False
prds_cal = prds_cal[mask]
labels_cal = labels_cal[mask]



In [22]:

    
sklearn.metrics.roc_auc_score(labels_cal,prds_cal)









    Out[22]:





0.86515017560119012



In [23]:

    
sklearn.metrics.log_loss(labels_cal,prds_cal)









    Out[23]:





0.19109615646043029

So this methods reducing the AUC score, but appears to improve the smoothness of the predictions. In the blog post they saw approximately the same thing, but they improved the log loss, which we don't necessarily care about, because we're not being scored on it.

Platt scaling with subject knowledge

We might expect that the Platt scaling will work better if it can tell which subject is which and bias appropriately. So, we can add a 1-of-k feature to the prediction (single-element) vector.

Unfortunately, we don't actually know which subject each of the predictions we're looking at here came from.



In [24]:

    
import json



In [25]:

    
with open("forestselection_gavin.json") as f:
    settings = json.load(f)



In [26]:

    
subject_1ofk = []
for segment in segments:
    fvector = np.zeros(len(settings['SUBJECTS']))
    for i,subject in enumerate(settings['SUBJECTS']):
        if subject in segment:
            # set appropriate element to 1
            fvector[i]=1
    subject_1ofk.append(np.array(fvector)[np.newaxis])
subject_1ofk = np.vstack(subject_1ofk)



In [27]:

    
subject_1ofk.shape









    Out[27]:





(9406, 7)



In [28]:

    
predictions.shape









    Out[28]:





(9406,)



In [29]:

    
predictions = predictions[np.newaxis].T
predictions.shape









    Out[29]:





(9406, 1)



In [30]:

    
X = np.hstack([predictions,subject_1ofk])



In [31]:

    
# cross validation again
prds_cal = []
labels_cal = []
cv = sklearn.cross_validation.StratifiedShuffleSplit(labels)
for train,test in cv:
    lr.fit(X[train],labels[train])
    prds_cal.append(lr.predict_proba(X[test])[:,1])
    labels_cal.append(labels[test])
# stack results
prds_cal = np.hstack(prds_cal[:])
labels_cal = np.hstack(labels_cal[:])



In [32]:

    
x,y = utils.reliability_plot(prds_cal,labels_cal)
plt.plot(x,y)
plt.xlabel("Binned mean predicted")
plt.ylabel("Binned fraction positive")
plt.plot([0, 1], [0, 1], 'k--')









    Out[32]:





[<matplotlib.lines.Line2D at 0x7fbd5feab8d0>]

AUC is improved, but log loss suffers:



In [33]:

    
sklearn.metrics.roc_auc_score(labels,predictions)









    Out[33]:





0.86384377771661014



In [34]:

    
sklearn.metrics.log_loss(labels,predictions)









    Out[34]:





0.25863190142133302

But what does it do to the submission AUC?

Let's talk turkey.

Above results are from one of the highest scoring models (forgot to switch back in v2 features). We can simply load the predictions it made in it's submission csv and transform them with the fitted logistic regression model we've created.

First, fit the logistic regression model to all the training data:



In [35]:

    
lr.fit(X,labels)









    Out[35]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)

Then load the submission csv:



In [36]:

    
import csv



In [37]:

    
testsegments = []
testpredictions = []
with open("output/forestselection_gavin_submission_using__v3_feats.csv") as f:
    c = csv.reader(f)
    # skip first line
    header = next(c)
    for line in c:
        testsegments.append(line[0])
        testpredictions.append(line[1])

Build the required 1-of-k subject feature:



In [38]:

    
subject_1ofk = []
for segment in testsegments:
    fvector = np.zeros(len(settings['SUBJECTS']))
    for i,subject in enumerate(settings['SUBJECTS']):
        if subject in segment:
            # set appropriate element to 1
            fvector[i]=1
    subject_1ofk.append(np.array(fvector)[np.newaxis])
subject_1ofk = np.vstack(subject_1ofk)



In [39]:

    
testpredictions = np.array(testpredictions)



In [40]:

    
testpredictions = testpredictions[np.newaxis].T
testpredictions.shape









    Out[40]:





(3935, 1)



In [41]:

    
testpredictions = testpredictions.astype(float)



In [42]:

    
subject_1ofk.shape









    Out[42]:





(3935, 7)



In [43]:

    
X = np.hstack([testpredictions,subject_1ofk])



In [44]:

    
ptest = lr.predict_proba(X)[:,1]



In [45]:

    
ptest









    Out[45]:





array([ 0.06069915,  0.05362842,  0.16231353, ...,  0.05899654,
        0.03058018,  0.05368407])

And save these new predictions to a file:



In [46]:

    
with open("output/platt_scaled_1ofk_forestselection_v3.csv","w") as f:
    c = csv.writer(f)
    c.writerow(header)
    for segment, prediction in zip(testsegments,ptest):
        c.writerow([segment,prediction])



In [53]:

    
!wc -l output/platt_scaled_1ofk_forestselection_v3.csv









    



3936 output/platt_scaled_1ofk_forestselection_v3.csv



In [54]:

    
!head output/platt_scaled_1ofk_forestselection_v3.csv

Submitted and it scored: 0.69822. Not really an improvement.

Trying with vanilla Platt scaling as above:



In [49]:

    
lr.fit(predictions,labels)









    Out[49]:





LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, penalty='l2', random_state=None, tol=0.0001)



In [50]:

    
ptest = lr.predict_proba(testpredictions)[:,1]



In [51]:

    
with open("output/platt_scaled_forestselection_v3.csv","w") as f:
    c = csv.writer(f)
    c.writerow(header)
    for segment, prediction in zip(testsegments,ptest):
        c.writerow([segment,prediction])



In [52]:

    
!head output/platt_scaled_forestselection_v3.csv

Submitted and scored exactly the same as without Platt scaling: 0.78169

Worth checking that it actually changed the probabilities:



In [55]:

    
!head output/forestselection_gavin_submission_using__v3_feats.csv

Yeah, there are differences, I suppose it has simply been scaled.